1 Abstract

The purpose of this analysis is to identify physical & chemical properties affecting white wine quality.

2 Dataset

The dataset containing quality ranking of three wine tasting experts with details of chemical composition of 4898 white wine samples

This dataset is made public by,

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

3 Exploratory Analysis

3.1 Dataset Structure Investigation

Dataset dimensions

## [1] 4898   13

Dataset content

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The dataset includes 13 columns (1x index, 11x input variables, 1x output attribute)

We will drop index column X

New dataset dimensions

## [1] 4898   12

3.2 Univariant Analysis

3.2.1 Univariate Plots

Destiny plot of all variables & attributes

Looking into each variable in isolation from other attributes.

3.2.2 Analyzing Input Variables

summary(subset(ww, select = -c(quality)))
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20

1 - fixed acidity (tartaric acid - \(g / dm^3\))
Account for nonvolatile acids in wine, normally distributed around 6.8 \(g/dm^3\)

2 - volatile acidity (acetic acid - \(g / dm^3\))
Accounts of vinegar taste in wine and follow right skewed unimodal distribution

3 - citric acid (\(g / dm^3\))
Adds ‘freshness’ to the wine, found in small quantities. Follows a right skewed unimodal distribution

4 - residual sugar (\(g / dm^3\))
Right skewed distribution with high peak around the lower edge of IQR (at 1.7 \(g/dm^3\))

5 - chlorides (sodium chloride - \(g / dm^3\))
Accounts for salty taste in wine, right skewed with 75% of sample below 0.05\(g/dm^3\)

6 - free sulfur dioxide (\(mg / dm^3\))
Right skewed normal distribution around 34.00\(mg/dm^3\)

7 - total sulfur dioxide (\(mg / dm^3\))
Right skewed normal distribution around 134.0\(mg/dm^3\)

8 - density (\(g / cm^3\))
Density follow a normal distribution with mean at 0.994

9 - pH (0 most acid, 7 neutral, 14 most base)
pH is a contentious scale represents acidity where 0 is the most acid, 7 is neutral, and 14 is the most base. Analysis shows a normal distribution of pH around 3.1 with IQR ~0.2

10 - sulphates (potassium sulphate - \(g / dm3\))
Sulphates is a wine additive acts as antimicrobial and antioxidant. Distribution shows a skew to the right of distribution.

11 - alcohol (% by volume)
Alcohol follow a right skewed distribution with IQR between 9.50% and 11.40% by volume. yet the distribution shape is almost bi-modal (two peaks) with another peak around 12.5% by volume

3.2.3 Analyzing Output Attributes

12 - quality (score between 0 and 10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality distribution appears unimodal normal distribution centered at 6. with most wines have grades of 5, 6, and 7.

Create a new categorical variable quality.grp from quality as per the following slices - Bad : Wine quality rating less than or equal 5 - Good : Wine quality rating between 5 and 6 (6 included) - Great : Wine quality rating above 6

##   Bad  Good Great 
##  1640  2198  1060

3.2.4 Univariant Analysis Summary

  • 75% of the dataset (by count) have quality value less than or equal to 6.00 (the median value)
    • Only 25% of samples in the dataset sample is classified as Great
  • Alcohol, Ph, Total Sulfur Dioxide, & Sulphates seams to have a large IQR compared to total count.

  • Residual Sugar, Chlorides, Denisty, Free Sulfur Dioxide, & Citrix Acid seams to have multiple outliers (\(>2\mu\)) in comparison to other input parameters.

  • A bivariant analysis is required to identify possible correlation across different parameters.

3.3 Bivariant Analysis

3.3.1 Correlation Analysis

We will start the bivariant analysis by identifying the correlation across different parameters in the data set.

Running the significance test at \(\beta \leq 0.05\) and omit statistically insignificant results from the correlation graph.

Graph above infers the following significant correlation relations

  • Strong positive correlation between Density and Residual Sugar

  • Strong negative correlation between Alcohol and Density

  • Medium positive correlation between

    • Quality and Alcohol

    • Density and Total Sulfur Dioxide

  • Medium negative correlation between

    • Quality and Density

    • Alcohol and each of (Residual Sugar, Total Sulfur Dioxide, Chlorides, Free Sulfur Dioxide)


The following table highlights correlation values between Quality & other variables in the dataset.

##              Attributes QualityCorrelation AbsQualityCorrelation IQR
## 1               alcohol              0.436                 0.436  Q4
## 2               density             -0.307                 0.307  Q4
## 3             chlorides             -0.210                 0.210  Q4
## 4      volatile.acidity             -0.195                 0.195  Q3
## 5  total.sulfur.dioxide             -0.175                 0.175  Q3
## 6         fixed.acidity             -0.114                 0.114  Q2
## 7                    pH              0.099                 0.099  Q2
## 8        residual.sugar             -0.098                 0.098  Q2
## 9             sulphates              0.054                 0.054  Q1
## 10          citric.acid             -0.009                 0.009  Q1
## 11  free.sulfur.dioxide              0.008                 0.008  Q1

Based on above correlation analysis.

  • Citric Acid, Free Sulfur Dioxide, Sulphates are not correlated with wine quality.
  • Residual Sugar, pH, Fixed Acidity are very weakly correlated with wine quality.
  • Total Sulfur Dioxide, and Volatile Acidity are relatively correlated with wine quality (still correlation is considered weak compared to correlation observed in Q4)
  • Alcohol, Density, Chlorides are strongly correlated with wine quality.

Hence, we will

  • Drop Citric Acid, Free Sulfur Dioxide, Sulphates, pH, Fixed Acidity from the rest of this analysis.
  • Irrespective of weak correlation between Residual sugar & wine quality, We will keep Residual Sugar due to it’s strong correlation with Density.
  • Focus the study on Alcohol, Density, Chlorides, Total Sulfur Dioxide, Volatile Acidity, Residual Sugar and their relationship with Wine Quality

3.3.2 Bivariant Plots

We will start by plotting all variables against each others mapped to Wine Quality grade (Bad, Good, Great)

We will use boxplot per wine grade to understand descriptional properties of different variables deeper.

It is clear from the above box plot that we have outliers in multiple variables that may cloud our conclusions. Hence, we will be subsetting the dataset to reflect only 95% of each variable individually.

We will NOT drop any data from the dataset at this point, instead, we will adjust analysis window for the 95 percentile to eliminate outliers noise.

In the next few sections, we will try to analyze Quality as the main output variable together with building a deeper understanding of other variables inter-dependencies.

3.3.2.1 Quality

Plotting Quality vs. other wine attributes while maintaining color-code for wine grade (Bad, Good, Great) to identify if there is any pattern associated with great wines.

It is clear from the above box diagram that great quality wines have a strong positive correlation with alcohol density. The linear model plotted in orange shows a strong linear growth of wine quality with alcohol increase in wine.

The box plot also highlights the strong negative correlation between wine quality and Density, Chlorides and Volatile Acidity but the fact we have multiple outliers in those graphs is what is making the visual association harder.

Hence, we will plot the same parameters distribution per wine grade for ascending 95% ~ 98% of the dataset population to eliminate the last 2% ~ 5% outliers.

Density distribution of quality key paraemters, color coded by wine grade

We can infer from the above plot that,

A. Great quality wines have the highest median Alcohol level.

B. Great quality wines have the lowest median Density.

C. Great quality wines tend to have less Chlorides.

D. Great quality wines tend to have less Volatile Acidity levels.

3.3.2.2 Alcohol

Alcohol have a strong negative correlation with Density, and a weak negative correlation with Residual Sugar, Total Sulfur Dioxide, and Chlorides.

3.3.2.3 Density

Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the under-analysis attributes.

Analysis shows that

  • Density is strongly correlated with Residual Sugar at r = 0.84

  • Density is weakly correlated with Total Sulfur Dioxide at r = 0.53

3.3.2.4 Residual Sugar & Total Sulfur Dioxide

Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the under-analysis attributes.

Previous plot shows a weak correlation between Residual Sugar and Total Sulfur Dioxide with correlation coefficient “r” equals 0.4

3.3.3 Bivariant Analysis Summary

With regards to Wine Quality

  • Great quality wines have the highest median Alcohol levels and the lowest median Density levels.

  • Great quality wines tend to have less Chlorides & Volatile Acidity levels.


We have also noticed the following strong correlations in the dataset

  • Alchohol is strongly correlated with Density with correlation coefficient “r” equals -0.78.

  • Density is strongly correlated with Residual Sugar with correlation coefficient “r” equals 0.84.

3.4 Multivariate Analysis

3.4.1 Multivariant Plots

Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the Density attributes.

3.4.1.1 Quality interdependency on Alcohol vs. Density

Based on the findings in the bivariant analysis, in this section we will try to understand the relationship between wine quality and its strongly correlated variables in a multivariate environment.

Above diagram infer that, for the same Density level, great quality wines tend to have higher Alcohol content.

Let’s isolate the great-quality wine in its own graph to see if this statement still hold.

As we can see in the above plot, the great wine quality category tend to have higher rating with increased alcohol levels at the same density range.

On the other hand, the Bad wine quality category tend to have very low ratings with increased alcohol levels. Looks like there is a second order variable in play here.

Hence, we will start analyzing those Quality vs. second order variables.

3.4.1.2 Quality interdependency on Second order variables vs. Alcohol

We can infer from the above diagram that.

  • For the same Alcohol content, great quality wines tend to have lower Residual Sugar range.
  • Increased levels of Chlorides happens more with lower grades of wine.

We can increase the plot contrast via dropping the middle grade (Good Wine) wine group from the dataset & try to replot the graphs one more time.

It is clear from the above plot that,

  • For the same Alcohol content, great quality wines tend to have lower Residual Sugar value & range.
  • High levels of chlorides only exists in bad quality wines.
  • For the same Volatile Acidity level, great quality wines tend to have higher Alcohol content.
  • Samples with high Volatile Acidity or high Residual Sugar tend to have lower than average Alcohol levels.

3.4.1.3 Quality interdependency on Second order variables vs. Density

It is hard to draw conclusions from the above plot, let’s try to drop the middle grade wine (Good Wine) from the data set and try again.

We can infer from the above plot that,

  • Great quality wines tend to have lower Density.
  • For the same Density level, great quality wines tend to have higher Residual Sugar value.
  • Great quality wines tend to have a lower Total Sulfur Dioxide range compared to bad quality wines.

3.4.1.4 Quality interdependency Residual Sugar & Total Sulfur Dioxide

Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the Total Sulfur Dioxide & Residual Sugar attributes.

Dataset used in the following plot has been modified to remove the Good quality subgroup for better contrast.

We can infer from the above plot that, For the same Residual Sugar level, great quality wines tend to have lower Total Sulfur Dioxide content.

3.4.2 Multivariate Analysis Summary

We can summarize the key conclusions we gathered from the multivariate analysis into the following list.

  • Higher grade wines are the ones with the highest level of Alcohol & lowest Density.
  • For the same Alcohol content, great quality wines tend to have lower value & range of Residual Sugar.
  • For the same Volatile Acidity level, great quality wines tend to have higher Alcohol content.
  • For the same Density level, higher quality wine tend to have higher Alcohol content.
  • For the same Density level, great quality wines tend to have higher Residual Sugar value.
  • For the same Residual Sugar level, great quality wines tend to have lower Total Sulfur Dioxide content.
  • High levels of chlorides only exists in bad quality wines.
  • Samples with high Volatile Acidity or high Residual Sugar tend to have lower than average Alcohol levels.

4 Linear Regression

## Load fresh version of data
ww_lm <- read.csv('wineQualityWhites.csv')

m1 <- lm(quality ~ alcohol, data = ww_lm)
m2 <- update(m1, ~ . - alcohol + density)
m3 <- update(m2, ~ . + alcohol)
m4 <- update(m3, ~ . + fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
               chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates)

mtable(m1, m2, m3, m4, sdigits = 3)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = ww_lm)
## m2: lm(formula = quality ~ density, data = ww_lm)
## m3: lm(formula = quality ~ density + alcohol, data = ww_lm)
## m4: lm(formula = quality ~ density + alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates, data = ww_lm)
## 
## ================================================================================
##                              m1            m2            m3            m4       
## --------------------------------------------------------------------------------
##   (Intercept)               2.582***     96.277***    -22.492***    150.193***  
##                            (0.098)       (4.003)       (6.165)      (18.804)    
##   alcohol                   0.313***                    0.360***      0.193***  
##                            (0.009)                     (0.015)       (0.024)    
##   density                               -90.942***     24.728***   -150.284***  
##                                          (4.027)       (6.079)      (19.075)    
##   fixed.acidity                                                       0.066**   
##                                                                      (0.021)    
##   volatile.acidity                                                   -1.863***  
##                                                                      (0.114)    
##   citric.acid                                                         0.022     
##                                                                      (0.096)    
##   residual.sugar                                                      0.081***  
##                                                                      (0.008)    
##   chlorides                                                          -0.247     
##                                                                      (0.547)    
##   free.sulfur.dioxide                                                 0.004***  
##                                                                      (0.001)    
##   total.sulfur.dioxide                                               -0.000     
##                                                                      (0.000)    
##   pH                                                                  0.686***  
##                                                                      (0.105)    
##   sulphates                                                           0.631***  
##                                                                      (0.100)    
## --------------------------------------------------------------------------------
##   R-squared                 0.190         0.094         0.192         0.282     
##   adj. R-squared            0.190         0.094         0.192         0.280     
##   sigma                     0.797         0.843         0.796         0.751     
##   F                      1146.395       509.911       583.290       174.344     
##   p                         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -6111.983     -5831.127     -5543.740     
##   Deviance               3112.257      3478.689      3101.773      2758.329     
##   AIC                   11684.782     12229.967     11670.255     11113.480     
##   BIC                   11704.272     12249.456     11696.241     11197.936     
##   N                      4898          4898          4898          4898         
## ================================================================================
# fml1 <- as.formula(paste("quality", "~", paste(INDEPENDENT, collapse=' + ')))
#m1 <- lm(fml1, ww_lm)

Linear regression analysis has led to the following observations,

  • ~20% of Quality variance is explained by change in Alcohol level.
  • ~1% of Quality variance is explained by change in Density level.
  • ~28% of Quality variance is explained by change in all independent variables in the dataset.

Conclusion:

Linear modeling is not sufficient to predict white wine quality.

5 Final Plots and Summary

5.0.1 White Whine physical & chemical attributes correlations

Above correlation plot omits statistically insignificant results from the correlation graph at \(\beta \leq 0.05\)

Graph above infers the following significant correlation relations

  • Strong positive correlation between Density and Residual Sugar

  • Strong negative correlation between Alcohol and Density

  • Medium positive correlation between

    • Quality and Alcohol

    • Density and Total Sulfur Dioxide

  • Medium negative correlation between

    • Quality and Density

    • Alcohol and each of (Residual Sugar, Total Sulfur Dioxide, Chlorides, Free Sulfur Dioxide)

5.0.2 Quality vs. Alcohol & Density

Density distribution of Alcohol & Density color coded by wine grade

Above plot infer that great quality wines have the highest median Alcohol level and the lowest median density.

5.0.3 Quality interdependency on second order variables vs. Alchohol

Above plot infers that

  • For the same Alcohol content, great quality wines tend to have lower Residual Sugar value & range.

  • High levels of chlorides only exists in bad quality wines.

  • For the same Volatile Acidity level, great quality wines tend to have higher Alcohol content.

  • Samples with high Volatile Acidity or high Residual Sugar tend to have lower than average Alcohol levels.


6 Reflection

The white wine dataset used in this analysis contains multiple physical & chemical attributes & properties together with rating of wine quality. across the 5000 samples, linear regression has failed to predict white wine quality. Yet we have noticed a strong correlation between white wine quality and alcohol where 20% of quality variance can be explained by change in alcohol content.